How do Formula 1 constructors’ strategic choices across a racing season relate to their competitive success?
This project explores the high-dimensional relationship between strategic factors—such as participation, pit stops, and race performance—and final season outcomes by embedding these features into a two-dimensional principal component space and enabling interactive exploration.
The dataset aggregates detailed Formula 1 constructor-level statistics across multiple seasons. Each record represents a constructor in a particular season and includes:
These features offer a nuanced view of a constructor’s engagement, efficiency, and competitiveness during a season.
Handling Missing Data
Missing values were handled carefully by imputing zeros where
appropriate (e.g., zero pit stops for missed races), ensuring absence
from events was accurately reflected without biasing performance
metrics.
Feature Scaling
Principal Component Analysis (PCA) is sensitive to the scale of input
features. All numeric features (e.g., average finishing position,
podiums, races, pit stops) were standardized to have mean zero and unit
variance before fitting the PCA model.
Composite Metrics Construction
Fine-grained race performance indicators were aggregated into composite
scores to better capture a constructor’s overall seasonal effectiveness.
This enabled PCA to learn richer structures in the data by focusing on
patterns across multiple facets of performance.
PCA Fitting
After scaling, the PCA model was fit to the prepared feature set. The
resulting principal components were interpreted based on feature
loadings to understand the meaning of the embedded dimensions.
pit_summary <- pit_stops %>%
group_by(raceId, driverId) %>%
summarize(avg_pit_duration = mean(milliseconds, na.rm = TRUE), .groups = "drop")
constructor_data <- results %>%
left_join(pit_summary, by = c("raceId", "driverId")) %>%
left_join(races, by = "raceId") %>%
left_join(constructors, by = "constructorId") %>%
group_by(year, constructorRef) %>%
summarize(
avg_finish = mean(positionOrder, na.rm = TRUE),
avg_pit_time = mean(avg_pit_duration, na.rm = TRUE),
podiums = sum(positionOrder <= 3, na.rm = TRUE),
races = n(),
.groups = "drop"
) %>%
drop_na() # Remove rows with missing values
constructor_data <- constructor_data %>%
mutate(label = paste(constructorRef, year, sep = ""))
pca_model <- constructor_data %>%
select(avg_finish, avg_pit_time, podiums, races) %>%
scale() %>%
prcomp()
pca_df <- as_tibble(pca_model$x) %>%
bind_cols(constructor_data)
print(pca_model$rotation)
## PC1 PC2 PC3 PC4
## avg_finish 0.6551643 0.2615229 0.03852060 0.70772996
## avg_pit_time -0.2738251 0.6529871 -0.70432559 0.05052837
## podiums -0.6452645 -0.2941548 0.02868886 0.70447400
## races -0.2818035 0.6470600 0.70825036 -0.01677982
Interpretation:
avg_finish (positive, 0.655) and podiums (negative, -0.645) are the biggest drivers.
A lower avg_finish (better rank) and higher podiums indicate better performance.
So PC1 captures overall race success: lower finishes and more podiums are associated with lower PC1 scores.
avg_pit_time (positive, 0.653) and races (positive, 0.647) are the biggest drivers.
High pit times and number of races load together.
So PC2 captures a pit and participation effect: teams that raced more and had longer pit times.
p <- ggplot(pca_df, aes(x = PC1, y = PC2)) +
geom_point(
aes(
size = podiums,
color = avg_finish,
text = paste0(
"Constructor: ", constructorRef, "<br>",
"Season: ", year, "<br>",
"Avg Finish: ", round(avg_finish,2), "<br>",
"Podiums: ", podiums
)
),
alpha = 0.8
) +
scale_color_viridis_c(direction = -1, option = "plasma") +
scale_size_continuous(name = "Podiums", range = c(1, 10))+
theme_minimal() +
labs(
title = "PCA of Constructor Strategy by Season",
subtitle = "Color = Avg Finish (lower = better), Size = Podiums",
x = "PC1: Composite Race Metrics",
y = "PC2: Pit & Participation Effects",
color = "Avg Finish",
size = "Podiums"
) +
theme(legend.position = "right")
ggplotly(p, tooltip = "text") %>%
layout(
showlegend = TRUE
)
Projection Technique
PCA was selected to summarize high-dimensional constructor performance
and strategy into a two-dimensional, human-interpretable space.
Point Encoding
Interactivity
Axis Interpretation
PCA loadings were used to interpret the axes:
Design Trade-offs
Reproducibility
All design elements were generated programmatically from the data,
ensuring consistency and reproducibility.
These findings challenge the assumption that mere participation or pit stop optimization leads to better outcomes. Instead, actual competitive race performance remains the strongest driver of success.